March 2023
AIM: Given a reference sequence and a set of short reads, align each read to the reference sequence finding the most likely origin of the read sequence.
Commonly used aligners: STAR, HISAT2
switch to quasi-mapping (Salmon) or pseudo-alignment (Kallisto)
These tools avoids base-to-base alignment of the reads
~ 20 times faster than the traditional alignment tools like STAR, HISAT2, etc.
Unlike alignment based methods, pseudo-alignment methods focus on transcriptome
(~2% of the genome in human)
Use exact k-mer matching rather than aligning whole reads with mismatches and indels
Alignment based
Alignment-free (quasi-mapping or pseudoalignment)
Quantification: How many reads have come from a genomic feature?
If we had mapped our reads to the genome (rather than the transcript sequences), our mapping would look like this:
As we know the transcript locations (from an annotation file in GFF or GTF format), the simplest approach is to count how many reads overlap each gene.
Salmon performs quantification as part of the quasi-mapping process, directly against the transcriptome.
Algorithm accounts for several biases:
Methods like Salmon attempt to mitigate the effect of technical biases by
estimating sample-specific bias parameters.
Patro et al. (2017) Nature Methods doi:10.1038/nmeth.4197
Genome/Transcriptome indexing is often required by bioinformatic tools such as aligners.
Similar to a book index, it allows the (pseudo)alignment algorithms to find regions of the large reference genome much quicker.
Salmon:
Index is built for the transcriptome reference
But we also include the reference genome as a “decoy”
Example code:
# concatenate FASTA files for the transcriptome and genome cat transcripts.fasta genome.fasta > gentrome.fasta # create a text file with the names of the genomic sequences (i.e. chrosomomes/scaffolds) zcat genome.fasta | grep ">" | cut -d " " -f 1 | sed 's/>//' > decoys.txt # run the indexing salmon index -t references/gentrome.fasta -d decoys.txt -i salmon_index
See this tutorial for details.
Once we have the index, we can proceed with quasi-mapping and quantification:
salmon quant \
-i salmon_index \
-l A \
-1 SAMPLE_R1.fastq.gz \
-2 SAMPLE_R2.fastq.gz \
-o output/directory/SAMPLE \
--gcBias --seqBias
Using --gcBias and --seqBias options accounts for the biases mentioned earlier
Main output is a tab-delimited (TSV) file for each sample we process:
Name Length EffectiveLength TPM NumReads ENSMUST00000177564.1 16 15.000 0.000000 0.000 ENSMUST00000196221.1 9 9.000 0.000000 0.000 ENSMUST00000179664.1 11 11.000 0.000000 0.000 ENSMUST00000178537.1 12 12.000 0.000000 0.000 ENSMUST00000178862.1 14 13.000 0.000000 0.000 ENSMUST00000179520.1 11 11.000 0.000000 0.000 ENSMUST00000179883.1 16 15.000 0.000000 0.000 ENSMUST00000195858.1 10 10.000 0.000000 0.000 ENSMUST00000179932.1 12 12.000 0.000000 0.000
These are the files we will use in downstream differential expression analysis